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Abstract 

We discuss approximation of functions using deep neural nets. Given a function / on a 
d-dimensional manifold T C M’", we construct a sparsely-connected depth-4 neural network 
and bound its error in approximating /. The size of the network depends on dimension 
and curvature of the manifold T, the complexity of /, in terms of its wavelet description, 
and only weakly on the ambient dimension m. Essentially, our network computes wavelet 
functions, which are computed from Rectified Linear Units (ReLU). 


1 Introduction 

In the last decade, deep learning algorithms achieved unprecedented success and state-of-the- 
art results in various machine learning and artificial intelligence tasks, most notably image 
recognition, speech recognition, text analysis and Natural Language Processing |T2]. Deep 
Neural Networks (DNNs) are general in the sense of their mechanism for learning features of 
the data. Nevertheless, in numerous cases, results obtained with DNNs outperformed previous 
state-of-the-art methods, often requiring significant domain knowledge, manifested in hand¬ 
crafted features. 

Despite the great success of DNNs in many practical applications, the theoretical framework 
of DNNs is still lacking; along with some decades-old well-known results, developing aspects 
of such theoretical framework are the focus of much recent academic attention. In particular, 
some interesting topics are (I) specification of the network topology (i.e., depth, layer sizes), 
given a target function, in order to obtain certain approximation properties, (2) estimating the 
amount of training data needed in order to generalize to test data with high accuracy, and also 
(3) development of training algorithms with performance guarantees. 

1.1 The contribution of this work 

In this manuscript we discuss the first topic. Specifically, we prove a formal version of the 
following result: 
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Theorem (informal version) 1.1. Let F C M”* be a smooth d-dimensional manifold, f G 
L 2 (r) and let S > 0 be an approximation level. Then there exists a depth-f sparsely-connected 
neural network with N units where N = N{5,T, f,m), computing the function /v such that 

\\f-fN\\l<h. (1) 

The number N = N{6,T, f,m) depends on the complexity of /, in terms of its wavelet 
representation, the curvature and dimension of the manifold F and only weakly on the ambient 
dimension m, thus taking advantage of the possibility that d <C m, which seems to be realistic 
in many practical applications. Moreover, we specify the exact topology of such network, and 
show how it depends on the curvature of F, the complexity of /, and the dimensions d, and m. 
Lastly, for two classes of functions we also provide approximation error rates: L 2 error rate for 
functions with sparse wavelet expansion and point-wise error rate for functions in C^: 

• if / has wavelet coefficients in li then there exists a depth-4 network and a constant c so 
that 

ll/-/iv||i<^ (2) 

• if / G and has bounded Hessian, then there exists a depth-4 network so that 

\\f-fN\\oc = o[N--^). (3) 

1.2 The structure of this manuscript 

The structure of this manuscript is as follows: in Section we review some of the fundamental 
theoretical results in neural network analysis, as well as some of the recent theoretical devel¬ 
opments. In Section we give quick technical review of the mathematical methods and results 
that are used in our construction. In Section we describe our main result, namely construc¬ 
tion of deep neural nets for approximating functions on smooth manifolds. In Section we 
specify the size of the network needed to learn a function /, in view of the construction of the 
previous section. Sectionconcludes this manuscript. 

1.3 Notation 

F denotes a d-dimensional manifold in {(Ui,(pi)} denotes an atlas for F. Tangent hyper¬ 
planes to F are denoted hy Hi. f and variants of it stand for the function to be approximated. 

are scaling (aka ’’father”) and wavelet (aka ’’mother”) functions, respectively. The wavelet 
terms are indexed by scale k and offset b. The support of a function / is denoted by supp(/). 

2 Related work 

There is a huge body of theoretical work in neural network research. In this section, we review 
some classical theoretical results on neural network theory, and discuss several recent theoretical 
works. 
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A well known result, proved independently by Cybenko [5], Hornik [lOj and others states 
that Artificial Neural Networks (ANNs) with a single hidden layer of sigmoidal functions can 
approximate arbitrary closely any compactly supported continuous function. This result is 
known as the “Universal Approximation Property”. It does not relate, however, the number 
of hidden units and the approximation accuracy; moreover, the hidden layer might contain a 
very large number of units. Several works propose extensions of the universal approximation 
property (see, for example[Hl |H], for a regularization perspective and also using radial basis 
activation functions, m for all activation functions that achieve the universal approximation 
property). 

The first work to discuss the approximation error rate was done by Barron [T] , who showed 
that given a function / : —)• M with bounded first moment of the magnitude of the Fourier 

transform 

Cf = [ \w\\f{w)\ < oo (4) 

JR’" 

there exists a neural net with a single hidden layer of N sigmoid units, so that the output /at 
of the network satisfies 

II/-/will <|. (5) 

where Cf is proportional to Cf. We note that the requirement (|^ gets more restrictive when the 
ambient dimension m is large, and that the constant Cf might scale with m. The dependence 
on m is improved in US], m- In particular, in [T6| the constant is improved to be polynomial 
in m. For r times differentiable functions, Mahskar |15) constructs a network with a single 
hidden layer of N sigmoid units (with weights that do not depend on the target function) that 
achieves an approximation error rate 

( 6 ) 

which is known to be optimal. This rate is also achieved (point-wise) in this manuscript, 
however, with respect to the dimension d of the manifold, instead of m, which might be a 
significant difference when d <^m. 

During the decade of 1990s, a popular direction in neural network research was to construct 
neural networks in which the hidden units compute wavelets functions (see, for example |20j . 
jl8| and [21] )• These works, however, do not give any specification of network architecture to 
obtain desired approximation properties. 

Several most interesting recent theoretical results consider the representation properties of 
neural nets. Eldan and Shamir [7] construct a radial function that is efficiently expressible by a 
3-layer net, while requiring exponentially many units to be represented accurately by shallower 
nets. In HZ], Montufar et al. show that DNNs can represent more complex functions than can 
represent a shallow network with the same number of units, where complexity is defined as the 
number of linear regions of the function. Tishby and Zaslavsky |19j propose to evaluate the 
representations obtained by deep networks via the information bottleneck principle, which is a 
trade-off between compression of the input representation and predictive ability of the output 
function, however do not provide any theoretical results. 
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A recent work by Chui and Mhaskar brought to our attention [3] constructs a network with 
similar functionality to the network we construct in this manuscript. In their network the low 
layers map the data to local coordinates on the manifold and the upper ones approximate a 
target function on each chart, however using B-splines. 

3 Preliminaries 

3.1 Compact manifolds in 

In this section we review the concepts of smooth manifolds, atlases and partition of unity, which 
will all play important roles in our construction. 

Let r C be a compact d-dimensional manifold. We further assume that L is smooth, 
and that there exists d > 0 so that for all x G L, il(x, <5) n L is diffeomorphic to a disc, with a 
map that is close to the identity. 

Definition 3.1. A chart for T is a pair {U,4>) such that U 'if T is open and 

(7) 

where (f is a homeomorphism and M is an open subset of a Euclidean space. 

One way to think of a chart is as a tangent plane at some point x G 1/ C L, such that the 
plane defines a Euclidean coordinate system on U via the map (f>. 

Definition 3.2. An atlas for T is a collection {{Ui,4>i)}i£i of charts such that = E. 

Definition 3.3. Let T he a smooth manifold. A partition of unity of V w.r.t an open 
cover {Ui}i^i is a family of nonnegative smooth functions such that for every x G E, 

Yjihi{x) = 1 and for every i, supp(ryj) C (Lj). 

Theorem 3.4. (Proposition 13.9 in \14]l ) Let T be a compact manifold and {Ui}i£j be an open 
cover o/E. Then there exists a partition of unity {r]i}i£i such that for each i, tji is in C°°, has 
compact support and supp(ryi) C Ui. 

3.2 Harmonic analysis on spaces of homogeneous type 
3.2.1 Construction of wavelet frames 

In this section we cite several standard results, mostly from [6], showing how to construct a 
wavelet frame of L 2 (Sf ), and discuss some of its properties. 

Definition 3.5. (Definition 1.1 in w 

A space of homogeneous type {X, y, d) is a set X together with a measure y and a quasi¬ 
metric 5 (satisfies triangle inequality up to a constant A) such that for every x G X, r > 0 

• 0 < y{B{x, r)) < oo 
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• There exists a constant A' such that fj,{B{x,2r)) < A'fj,{B{x, r)) 

In this manuscript, we are interested in constructing a wavelet frame on which, equipped 
with Lebesgue measure and the Euclidean metric, is a space of homogeneous type. 

Definition 3.6. (Definition 3.14 in m 

Let be a space of homogeneous type. A family of functions {Sk}keZ; Sj- '■ X x X ^ C 

is said to he a family of averaging kernels (‘father functions”) if conditions 3.14 — 3.18 and 
3.19 with a = e in m are satisfied. A family {Dk}k&z, : X x X ^ C is said to he a family 
of (“mother”) wavelets if for all x,y £ X, 

Dk{x, y) = Sk{x, y) - Sk-i{x, y), (8) 

and Sk,Sk-i are averaging kernels. 

By standard wavelet terminology, we denote 


= 2 2Dk{x,b). (9) 

Theorem 3.7. (A simplified version of Theorem 3.25 in f^) 

Let {5fc} be a family of averaging kernels. Then there exist families {tpk,b},{'f’k,b} such that for 
all f £ L 2 {R'^) 

fix)= ^ {f,^k,b)f^k,bix) (10) 

(fc,fe)eA 

Where the functions 'ipk,b given by Equations Q and and A = {{k,b) G Z x 
b £ 2-aZ‘^}. 

Remark 3.8. The kernels {Sk} need to be such that for every x £ J2{kb)&A^k{x,b) is 

sufficiently large. This is discussed in great generality in chapter 3 in [^. 

Remark 3.9. The functions ipk^b are called dual elements, and are also a wavelet frame of 

L2{Mfi). 

3.3 Approximation of functions with sparse wavelet coefficients 

In this section we cite a result from [2] regarding approximating functions which have sparse 
representation with respect to a dictionary D using finite linear combinations of dictionary 
elements. 

Let / a function in some Hilbert space TL with inner product (•, •) and norm || • ||, and let 
D £ hi he a. dictionary, i.e., any family of functions (g)g£x> with unit norm. Assume that / can 
be represented as a linear combination of elements in D with absolutely summable coefficients, 
and denote the sum of absolute values of the coefficients in the expansion of / by ||/||£i. 

In [2], it is shown that Ci functions can be approximated using N dictionary terms with 
squared error proportional to As a bonus, we also get a greedy algorithm (though not 
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always practical) for selecting the corresponding dictionary terms. OGA is a greedy algorithm 
that at the /c’th iteration computes the residual 

rfc_i := / -/fc-i, (11) 

finds the dictionary element that is most correlated with it 

G argmc^|(rA:-i,ff)| (12) 

g£V 

and defines a new approximation 

fk ■■= Pkf, (13) 

where is the orthogonal projection operator onto spanjg'i, ...,5^}. 

Theorem 3.10. (Theorem 2.1 from m) The error rjy of the OGA satisfies 

||/-/iv||<||/||A(iV + l)-'/'. (14) 

Clearly, for TL = we can choose the dictionary to be the wavelet frame given by 

P = {A,b :{k,b)£Zx R^, b G 2-^Z}. (15) 

Remark 3.11. Let P = {V’fc.fe} be a wavelet frame that satisfies the regularities in conditions 
3.14-3.19 in [6j. Then if a function / is in Ci with respect to V, it is also in £1 with respect to 
any other wavelet frame that satisfies the same regularities. In other words, having expansion 
coefficients in li does not depend on the specific choice of wavelets (as long as the regularities 
are satisfied). The idea behind the proof of this claim is explained in appendix [A} 

Remark 3.12. Section 4.5 in [6] gives a way to check whether a function / has sparse coeffi¬ 
cients without actually calculating the coefficients: 

f€CiiS^2’^P\\f*f;k,o\\i<oo, (16) 

fcez 

i.e., one can determine if / G £1 without explicitly computing its wavelet coefficients; rather, 
by convolving / with non-shifted wavelet terms in all scales. 


4 Approximating functions on manifolds using deep neural nets 


In this section we describe in detail the steps in our construction of deep networks, which are 
designed to approximate functions on smooth manifolds. The main steps in our construction 
are the following: 


1. We construct a frame of 


in which the frame elements can be constructed from 


rectified linear units (see Section 4.1). 
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2. Given a d-dimensional manifold T C M”*, we construct an atlas for T by covering it with 


open balls (see Section 4.2). 


3. We use the open cover to obtain a partition of unity of T an d consequently represent any 
function on F as a sum of functions on (see section 


4.3). 


4. We show how to extend the wavelet terms in the wavelet expansion, which are defined on 
to in a way that depends on the curvature of the manifold T (see Section 4.4). 


4.1 Constructing a wavelet frame from rectifier units 

In this section we show how Rectified Linear Units (ReLU) can be used to obtain a wavelet 
frame of L 2 {M.'^). The construction of wavelets from rectifiers is fairly simple, and we refer to 
results from Section 3.2 to show that they obtain a frame of L 2 (I^'^)- 
The rectifier activation function is defined on M as 


rect(x) = max{ 0 , x}. 

we define a trapezoid-shaped function t : M —>■ M by 

t{x) = rect(x + 3) — rect(x + 1) — rect(x — 1) + rect(x — 3). 
We then define the scaling function (p : —)> M by 

ip{x) = Cdvect f - 2(4 - 1) j , 


where the constant Cd is such that 


ip{x)dx = 1 ; 


for example, (71 = ^. Following the construction in Section 


3.2 


we define 


Sk{x,b) = 2^ip{2d(x - b)) 

Lemma 4.1. The family {5"^} is a family of averaging kernels. 

The proof is given in AppendixNext we define the (“mother”) wavelet 

Dk{x,b) = Sk{x,b) - 5fc-i(x,6), 

And denote 

/c 

V’fc,fe(x) = 2~^Dk{x,b), 


as 


(17) 


(18) 


(19) 


( 20 ) 


( 21 ) 


( 22 ) 

(23) 
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and 


v^(x) = (24) 

= L>o(x,0) (25) 

= 5o(x, 0) - S'_i(x, 0) (26) 

= ip{x) -2-^ip{2--dx)). (27) 

Figure shows the construction of (p and in for d = 1, 2. 

Remark 4.2. We can see that 

'ipkfiix) = 2~2Dk(x,b) (28) 

= 2~^Skix,b) - Sk-i{x,b)) (29) 

= 2-t(2V(2^(x - b)) - 2^-^(p(2^(x - b))) (30) 

= 22 ^ip(2d (x — b)) — 2~^(p{2^^ (x — b))^ (31) 

= 2tV’( 2 ^( 3 ; - 6)) . (32) 


Remark 4.3. With the above construction, ip can be computed using a network with Ad 
rectifier units in the first layer and a single unit in the second layer. Hence every wavelet term 
tpkfi can be computed using 8d rectifier units in the first layer, 2 rectifier units in the second 
layer and a single linear unit in the third layer. From this, the sum of k wavelet terms can be 
computed using a network with '&dk rectifiers in the first layer, 2k rectifiers in the second layer 
and a single linear unit in the third layer. 

From Theorem |3.7| and the above construction we then get the following lemma: 

Lemma 4.4. {'il)k,b '■ k £ Ij,b ^ 2“*^Z} is a frame of L 2 (M'^). 

Next, the following lemma uses properties of the above frame to obtain point-wise error 
bounds in approximation of compactly supported functions / G 

Lemma 4.5. Let f G £ 2 ( 1 ^'^) be compactly supported, twice differentiable and let ||Vj||op be 
bounded. Then for every A: G N U {0} there exists a combination fx of terms up to scale K so 
that for every x G 

|/(a:) -/i^(a:)| = O (2"^^ . (33) 

The proof is given in Appendix [C) 

4.2 Creating an atlas 

In this section we specify the number of charts that we would like to have to obtain an atlas 
for a compact d -dimensional manifold T G M"*. 
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mother in 2D 



Figure 1: Top row, from left: the trapezoid function t, and the functions on M. Bottom 
rows: the functions tp, ip on from several points of view. 


For our purpose here we are interested in a small atlas. We would like the size Cr of such 
atlas to depend on the curvature of F: the lower the curvature is, the smaller is the number of 
charts we will need for F. 

Following the notation of Section 3.1 let (5 > 0 so that for all x G F, B(x,6) n F is 


diffeomorphic to a disc, with a map that is close to the identity. We then cover F with balls of 
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radius The number of such balls that are required to cover T is 


r 2'^SA{r) 

~Jd 


Ct < 


Td 


(34) 


where S'A(r) is the surface area of T, and Td is the thickness of the covering (which corresponds 
to by how much the balls need to overlap). 

Remark 4.6. The thickness Td scales with d however rather slowly: by [1], there exist covering 
with Td < dlogd + 5d. For example, in d = 24 there exist covering with thickness of 7.9. 

A covering of T by such a collection of balls defines an open cover of T by 


Ui = B{xi,6) nr. 


(35) 


Let Hi denote the tangent hyperplane tangent to T at Xj. We can now define an atlas by 
{{Ui, where cjii is the orthogonal projection from Ui onto Hi. 

The above construction is sketched in Figure Let ^i be the extension of to M™, 



Figure 2: Construction of atlas. 


i.e., the orthogonal projection onto H^. The above construction has two important properties, 
summarized in Lemma 14.71 


Lemma 4.7. For every x £ Ui, 


|x - ^i(x)||2 < n < - 


and for every x £ T \ Ui such that (j)i{x) G 4>i{Ui) 


\x - Mx)\\ 2 >r2 = ^S. 


(36) 


(37) 
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4.3 Representing a fnnction on manifold as a sum of functions in 

^ let / : r —)• M, let A = {(Ui, be an 


Let r be a compact d-dimensional mani fold i n 
atlas obtained by the covering in Section 


4.2 


and let (jii be the extension of (^i to 


{Ui] is an open cover of F, hence by Theorem 3.4 there exists a corresponding partition of 
unity, i.e., a family of compactly supported C°° functions such that 

• 77, : r -)■ [0,1] 

• supp(r7i) C {Ui) 

• Ei di = 1 

Let fi be dehned by 

fi{x) = f{x)vi{x), (38) 

and observe that fi = /. We denote the image 4>i{Ui) by Ij. Note that I, C Hi, i.e., I, lies 
in a d-dimensional hyperplane Hi which is isomorphic to We define /, on 

fi{4>~^{x)) X £ li 


as 


fi{x) = 


0 


otherwise 


(39) 


and observe that fi is compactly supported. This construction gives the following Lemma 
Lemma 4.8. For all x G T, 

fi{4>i{x)) = f{x). 


(40) 


Assuming /, G L 2 (M'^), by Lemma 4.4 it has a wavelet expansion using the frame that was 
constructed in Section [4T1 


4.4 Extending the wavelet terms in the approximation of fi to 

Assume that fi G L 2 (M'^) and let 

fi = Y ^k,b'^k,b, (41) 

{k,b) 

be its wavelet expansion, where a^^b £ and fjkfi is defined on 

We now show how to extend each f^k^b to M™. Let’s assume (for now) that the coordinate 
system is such that the first d coordinates are the local coordinates (i.e., the coordinates on 
Hi) and the remaining m — d coordinates are of the directions which are orthogonal to Hi. 

Intuitively, we would like to extend the wavelet terms on Hi to so that they remain 
constant until they ’’hit” the manifold, and then die off before they ’’hit” the manifold again. 
By Lemma 4.7 it therefore suffices to extend each to M™’ so that in each of the m — d 


orthogonal directions, 'ifk,b will be constant in [—and will have a support which 
is contained in [—^S=l. 

'■ s/m—a s/m—d^ 
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Recall from Remark 4.2 that each of the wavelet terms ipkfi in Equation (41) is defined on 
by 


'4’k,b{x) = 22 I (p(2d[x — b)) — 2 ip{2 d [x — b 


(42) 

(43) 


and recall that as in Equation (19), the scaling function (p was defined on on by 


/ d 

p>{x) = Cd rect I t{xj) — 2{d — 1) 

Vi=i 

We extend 'tpk,b to by 

ipk,b{x) = 2 ^ ^99^(23(x - b)) - 2“Vr(2^(x - b))j , 



where 


(44) 


(45) 


(pr(2d (x — b)) = Cd rect 


^^t(2d (xj — bj)) + tr(xj) — 2(m — 1) 

^J=l J=d+1 


(46) 


and tr is a trapezoid function which is supported on (small) base 

is between height 2. This definition of ipk,b gives it a constant height 

for distance ri from Hi, and then a linear decay, until it vanishes at distance r 2 . Then by 
construction we obtain the following lemma 


Lemma 4.9. For every chart {Ui,(f)i) and every x G T \ Ui such that (j)i{x) G 4>i{Ui), x is 
outside the support of every wavelet term corresponding to the i ’th chart. 


Remark 4.10. Since the m — d additional trapezoids in Equation ( |46[) d o not scale with k and 
shift with 6, they can be shared across all scaling terms in Equations (45) and (41), so that the 
extension of the wavelet terms from to M”* can be computed with 4(m — d) rectifiers. 


Finally, in order for this construction to work for all i = l,...,C'r the input x G M”* of 
the network can be first mapped to j^y g, linear transformation so that the each of the 

Cr blocks of m coordinates gives the local coordinates on T in the first d coordinates and on 
the orthogonal subspace in the remaining m — d coordinates. These maps are essentially the 
orthogonal projections (jii. 


5 Specifying the required size of the network 

In the construction of Section]^ we approximate a function / G T 2 (r) using a depth 4 network, 
where the first layer computes the local coordinates in every chart in the atlas, the second layer 
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computes rect functions that are to form trapezoids, the third layer computes scaling functions 
of the form (p{2d[x — b)) for various k,b and the fourth layer consists of a single node which 
computes 


Cr 


/ = EEV’ 


(d 

k,b’ 


(47) 


i=l {k,b) 


where ijj 


(i) 

k,b 


is a wavelet term on the i’th chart. This network is sketched in 



Layers: scalingterms of 
the form ( H A 

Layer 2: scaled and 
shifted rectifiers, to form 
trapezoids 


Layer 1; Cp blocks of m 
units, giving coordinates 
on each chart 



Figure 3: A sketch of the network. 


From this construction, we obtain the following theorem, which is the main result of this 
work: 


Theorem 5.1. Let T be a d-dimensional manifold in M”*, and let f G L 2 (T). Let 
be an atlas of size Cr for F, as in Section \4A Then f can be approximated using a f-layer 
network with mCr linear units in the first hidden layer, Sd'ff^fA + 4C'r(m — d) rectifier 
units in the second hidden layer, 2 rectifier units in the third layer and a single linear 

unit in the fourth (output) layer, where Ni is the number of wavelet terms that are used for 
approximating f on the i ’th ehart. 


Pro of. As in Section 


4.1 


4.3 


we construct functions /* on as in Equation (39), which, by Lemma 
have the property that for every x G F, Yl{ixeUi} ~ /(^)- that fi is 

compactly supported means that its wavelet approximation converges to zero outside (j)i{Ui). 
Together with Lemma |4.9i we then get that an approximation of / is obtained by summing up 
the approximations of all the ffs. 

A first layer of the network will consist mCr linear units and will compute the map as in the 
last paragraph of Section 4.4 i.e., linearly transform the input to Cr blocks, each of dimension 
m, so that in each block i the first d coordinates are with respect to the tangent hyperplane Hi 
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(i.e., will give the representation (j)i{x)) and the remaining m — d coordinates are with respect 
to directions orthogonal to Hi. 

For each i = 1,Cr, we approxi mate each /* to some desired approximation level 5 using 
Ni < oo wavelet terms. By Remark 4.3, fi can be approximated using 8dNi rectifiers in the 


second layer, 2Ni rectifiers in the third layer and a single unit in the fourth layer. By Remark 
4.10, on every chart the wavelet terms in all scales and shifts can be extended to M™' using (the 


same) 4(m — d) rectifiers in the second layer. 

Putting this together we get that to approximate / one needs a 4-layer network with mCr 
linear units in the first hidden layer -|- 4Cr(m — d) rectifier units in the second 

hidden layer, 2 rectifier units in the third layer and a single linear unit in the fourth 

(output) layer. □ 


Remark 5.2. For sufficiently small radius 6 in the sense of section 3.1 the desired properties 
of fi (i.e., being in L 2 and possibly having sparse coefficient or being twice differentiable) imply 
similar properties of /. 

Remark 5.3. We observe that the dependence on the dimension m of the ambient space in 
the first and second layers is through Cr, which depends on the curvature of the manifold. 
The number Ni of wavelet terms in the f’th chart affects the number of units in the second 
layer only through the dimension d of the manifold, not through m. The sizes of the third and 
fourth layers do not depend on m at all. 

Finally, assuming regularity conditions on the fi, allows us to bound the number Ni of 
wavelet terms needed for the approximation of fi. In particular, we consider two specific cases: 
fi G Cl and fi £ C'^, with bounded second derivative. 


Corollary 5.4. If fi £ C\ (i.e., fi has expansion coefficients in li), then by Theorem 3.10, fi 
can be approximated by a combination fi^Ni of Ni wavelet terms so that 


Wfi - fi,Ni\\2 < 


\\M 


Cl 




(48) 


Consequently, denoting the output of the net by f, N = maxj{W} ond M = maxj ||/i||£i, 
we obtain 

■ 2 ^ C-pM 




(49) 


iV-h 1’ 

using ci + C2N units, where ci = Cp{m + 4(m — d)) -|- 1 and C2 = (8d -|- 2)C'r. 

Corollary 5.5. If for each i fi’s is twice differentiable and ||VjJ|op is bounded, then by Lemma 
4-5 fi can be approximated by fx^i using all terms up to scale K so that for every x G 


\Mx) - fi,K(x)\ =02 


2K 

d 


(50) 


Observe that the grid spacing in the k ’th level is 2 d. Therefore, since f is compactly 
supported, there are O I (2d j )=0 (2^) terms in the k’th level. Altogether, on the i’th chart 
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there are O (2'^'*“^) terms in levels less than K. Writing N = 2^"*“^, we get a point-wise error 
2 

rate of N~d using ci + C 2 N units, where c\ = C'r(m + 4(m — d)) + 1 and C 2 = {8d + 2)Cr- 


Remark 5.6. The unit count in Theorem |5.1| and Corollaries |5.4| and 5.5 is overly pessimistic, 
in the sense that we assume that the sets of wavelet terms in the expansion of fi, fj do not 
intersect, where i,j are chart indices. A tighter bound can be obtained if we allow wavelet 
functions be shared across different charts, in which case the term Cr Ni in Theorem 5.1 can 


be replaced by the total number of distinct wavelet terms that are used on all charts, hence 
decreasing the constant 02 - In particular, in Corollary |5.5| we are using all terms up to the A'dh 
scale on each chart. In this case the constant C 2 = Sd + 2. 


Remark 5.7. The linear units in the first layer can be simulated using ReLU units with large 
positive biases, and adjusting the biases of the units in the second layer. Hence the first layer 
can contain ReLU units instead of linear units. 


6 Conclusions 

The construction presented in this manuscript can be divided to two main parts: analytical 
and topological. In the analytical part, we constructed a wavelet frame if L 2 (M'^), where the 
wavelets are computed from Rectified Linear units. In the topological part, given training data 
on a d-dimensional manifold T we constructed an atlas and represented any function on T as 
sum of functions that are defined on the charts. We then used Rectifier units to extend the 
wavelet approximation of the functions from to the ambient space This construction 
allows us to state the size of a depth 4 neural net given a function / to be approximated on 
the manifold T. We show how the specified size depends on the complexity of the function 
(manifested in the number of wavelet terms in its approximation) and the curvature of the 
manifold (manifested in the size of the atlas). In particular, we take advantage of the fact that 
d can possibly be much smaller than m to construct a network with size that depends more 
strongly on d. In addition, we also obtain squared error rate in approximation of functions 
with sparse wavelet expansion and point-wise error rate for twice differentiable functions. 

The network architecture and corresponding weights presented in this manuscript is hand¬ 
made, and is such that achieves the approximation properties stated above. However, it is 
reasonable to assume that such network is unlikely to be the result of a standard training 
process. Hence, we see the importance of the results presented in this manuscript by describing 
the theoretical approximation capability of neural nets, and not by describing trained nets 
which are used in practice. 

Several extensions of this work can be considered. First, a more efficient wavelet represen¬ 
tation can be obtained on each chart if one allows its wavelets to be non-isotropic (that is, to 
scale differently in every dimension) and not necessarily axis aligned, but rather, to correspond 
to the level sets of the function being approximated. When the function is relatively constant 
in certain directions, the wavelet terms can be ’’stretched” in these directions. Such thing can 
be done using curvelets. 
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Second, we conjecture that in the representation obtained as an output of convolutional and 
pooling layers, the data concentrates near a collection of low dimensional manifolds embedded 
in a high dimensional space, which is our starting point in the current manuscript. We think 
that this is a result of the application of the same filters to all data points. Assuming our 
conjecture is true, one can apply our construction to the output of convolutional layers, and by 
that obtain a network topology which is similar to standard convolutional networks, namely 
fully connected layers on top of convolutional ones. This will make or arguments here applicable 
to cases where the data in its initial representation does not concentrate near low dimensional 
manifold, but its hidden representation does. 

Finally, we remark that the choice of using rectiher units to construct our wavelet frame 
is convenient, however somewhat arbitrary. Similar wavelet frames can be constructed by any 
function (or combination of functions) that can be used to construct “bump” functions i.e., 
functions which are localized and have fast decay. For example, general sigmoid functions 
cr : M —)> M, which are monotonic and have the properties 


lim (t{x) = 0 and lim cr{x) = 1 

—OO X—>O0 


(51) 


can used to construct a frame in a similar way, by computing “smooth” trapezoids. Recall also 


that by Remark 3.11, any two such frames are equivalent. 
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A Equivalence of representations in different wavelet frames 

Consider to frames {'tpk,b} and f,}. Any element y can be represented as 

= '^{'^k',b':^k,b)'ll^k,b- (52) 

k,b 

Observe that in case k k, k\ the inner product is of large magnitude only for a small number of 
h's. In case k k' oi k ^ k\ the inner product is between peaked function which integrates to 
zero and a flat function, hence has small magnitude. This idea is formalized in a more general 
form in Section 4.7 in [6]. 

B Proof of Lemma 14.11 


Proof. In order to show that the family {S'fc} in Equation is a valid family of averaging 
kernel functions, we need to verify that conditions 3.14 — 3.19 in [6] are satisfied. Here p{x, b) is 
the volume of the smallest Euclidean ball which contains x and b, namely p{x, b) = c\\x — 6||^, 
for some constant c. Our goal is to show that there exist constants (7 < oo, u > 0 and e > 0 
such that for every A; G Z, and x, x', b, b' G 

• 3.14: 

2^—ke 


Proof. WLOG we can assume 6 = 0, and let e be arbitrary positive number. It can be 
easily verihed that there exists a constant C such that 


Then 


, X C" 

(54) 

Pix) < -T—. 

(c“^ + X 

Sk{x,0) = 2’^p (2dx^ 

(55) 

, 2^ 


^ C' 

(c-i +2^11x11'=')^+" 

(56) 

oAi(H-e)o—/c€ 

(57) 

— (j> 

(c-i + 2fc||x||'=')^+" 

= c 1 , 

(c-i2-fc + ||x||'^)^+^ 

(58) 

o—/ce 

= Ci , 

(2-^ + p(x,0))'+^ 

(59) 
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where Ci = . 


□ 


3.15,3.16: Since Sk{x,b) depends only on x — 6 and is symmetric about the origin, it 

2A' 

p{x,x') 2“^*^ 


suffices to prove only 3.15. We want to show that if p{x, x') < ^(2 ^ + p{x, b)) then 


|5fc(x,6)-5fc(x',6)| <C(^^ 


^ + p{x,b) J {2~^ + p{x,b)y+^' 


(60) 


Proof. WLOG 6 = 0; we will prove for every x,x'. Let e be arbitrary positive number, 
and let cj = By the mean value theorem we get 


Denote 

Then 


|5fc(x,0) - 5fc(x',0)| 1,,^^^^ 

- - -—-< max -||V3:(5fc(zfc,0))||. 

PyX^X'y Zk between 01,x' C 


F(x) = ||V,(5o(x,0))||. 
|V,(5fc(x,0))|| = 2'^23Ff2ax 


(61) 


(62) 


(63) 


Fix) < C' _ ^ _ ^ _ 

^ ^ - (c-i + llxll-^)" (c-i + ||x||'^)l+' 


We then get 


p{X,XF c 


(c-1 + 2^||x||'^)'" (c-i + 2^||x||‘^)i+^ 


= 2‘^2dF{2d 

< c 
= c' 

= c' 

= C2 


As in the proof of condition 3.14, it can be easily verified that there exists a constant C 
such that 

(64) 

(65) 

( 66 ) 

(67) 

( 68 ) 

(69) 

(70) 

where C 2 = . □ 


k 

2d 


k 

2 d 


(c-1 + 2^||x||'^)'" (c-i + 2^||x||‘^)i+^ 


1 


'i—ke 


{c~^2~^ + ||x||'^)°^ (c ^2 ^ + ||x||'^)^+'^ 
1 2-^" 

(2-fc + p(x, 0))“" (2-^ + p(x, 0))!+^ ’ 


3.17,3.18: Since Sk{x,b) depends only on x — 6 and is symmetric about the origin, it 
suffices to prove only 3.17. 
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Proof. By Equation (19) 



and consequently for every A; G Z and b 


( 71 ) 


Sk{x, b)dx = 1. 


(72) 

□ 


• 3.19: we want to show if p{x,x') <2^(2 ^ + p{x,b)) and p{b,b') < c(2 ^ + p{x,b)) then 


|5fc(x, b) - Sk{x', b) - Sk{x, b') + Sk{x\ 6')| 

^ / p{x,x') y / p(b,b') y 2-^" 

- \2-’^ + p{x,b) J \2-^ + p{x,b) J (2“^+ /9(x,6))1+^' 


(73) 

(74) 


Proof. We will prove for all x, x' , b, b' . Let (x = \- Observe that 

\Sk{x, b) - Skjx', b) - Skjx, b') + Skjx', b')\ 
p{x, x'Yp(b, b'Y 

I \Sk{x,b)-Sk{x',b)\ \Sk{x,b')+Sk{x',b')\ i 

^ I p{x,x')^ ' p{x,x')^ I 

p{b,b'Y 


(75) 

(76) 

(77) 


Denote 

T?ru\ - \Sk{x,b) - Sk{x',b)\ 
- 

Then by applying the mean value theorem twice we get 
■ \Sk{x,b)-Sk{x',b)\ \Sk{x,b')+Skix',b'' " ■ 


p{x,x')° 


p{x^x')^ 


p{b,b'Y 

\F{b)-F{P)\ 

p{b,b'Y 

- < max Vb{F{z)) 

C z between 6,6' 


- max Vfo ( 

C z between 6,6' y 


f\Sk{x,z) - Skix',z)\ 


p{x, x') 


< max max IIV^ ^(5^(2:', z))|| 

2 between 6,6' z' between x^x' ’ 


(78) 

(79) 

(80) 
(81) 
(82) 
(83) 
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From this, we can see that Since Sk is compactly supported and bounded, there exist 
compactly supported function ^(x) such that 



\So{x, b) - So{x', b) - So(x, b') + Sq{x' , 6 ')| 

(84) 


/ 9 (x, x'Yp{b, b'Y 


< i{x - b) + i{x - b'), 

(85) 

and consequently 

\Sk{x, b) - Sk{x\ b) - Sk{x, b') + Skix', V)\ 

( 86 ) 


p{x, x'YpY, b'Y 


< 2 ^ 2 t (^25(x-6)) +^(^25(x-6'))) • 

(87) 

As in the proof of conditions 3.14, 3.15, there exists a constant C such that 


C(x - 

6) + e(x b)<C ^ 11^ _ ^ 11 ^ _ . 

( 88 ) 

We then get 

\Skix, b) - Skix', b) - Skix, b') + Skix', b')\ 

(89) 


pix, x'YpY, b'Y 


<2^2^ (2^x-b)^ +^(^ 2 t(x- 6 '))) 

(90) 


2k j 

9~d 9^ 

^ r' 

- (c -2 + 2 ^||x - bYY- (c -1 + 2 fc||x - 611'^)!+^ 

(91) 


_ ^ " 

(g- 22 -A: _)_ 2; _ 5 <() 2 o' + x — 5 

(92) 


1 2 -^^ 

^ ( 2 -^ + pix, 6 )) 2 - ( 2 -^ + pix, b)y+^ ’ 

(93) 

where C 3 = 

□ 

Finally, we set C = max{Ci, ( 72 , 6 * 3 }. 

□ 


c 


Proof of Lemma 


4.5 


We first prove the following propositions. 

Proposition C.l. For each k,b, '4^k,bi '^k,b have two vanishing moments. 
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( 94 ) 


Proof. Note that a function / on which is symmetric about the origin satisfies 

/ xf{x)dx = 0. 

We hrst show that for every {k, b) G A, 'ipk,b has two vanishing moments. For each {k, b) G 




(p{2d (x — b))dx = / (p{x)dx 

jRd J^d 

= 1 , 

by change of variables. This gives that for every {k, 5) G Z x 

/ 'f’k,b{x)dx = 2^ / ip{2d(x — b) — ip(2^dr(^x — b))dx 
jRd ’ jRd V / 

= 0 , 


(95) 

(96) 

(97) 

(98) 


Hence the first moment of 'tpp^^b vanishes. Further, since (p is symmetric about the origin we 
have 


xp \2d[x — b)] dx = / (2 dy-\-b)(p{y)dy 
• V / ./Dd 

p{y)dy 


= 2-^b 


= 2-% 


which gives 


/ xfdkbixfjdx = 2 2 / if (2d (x — b)) — 2 ^p(2 d (x — b)) dx 
JRd ’ jRd V / V / 

= 2~^ (2~^b — 2~^2~^^~ 

= 2-i (2-H-2-^b) 

= 0 , 


(99) 

( 100 ) 

( 101 ) 

( 102 ) 

(103) 

(104) 

(105) 


hence the second moment of ifk,b also vanishes. 

Finally, to show that the functions ^pk,b have two vanishing moments as well, we note that 
the dual functions are obtained using convolution with operators Dk (0, p. 82), which, by the 
above arguments, have two vanishing moments; hence they inherit this property. □ 

Proposition C.2. For every {k,b), 'ipkfi decays faster than any polynomial. 


Proof. By m, 


p- 


82), the dual functions are also wavelets, hence they satisfy condition 3.14 

e can be arbitrarily large, it implies that 
the duals satisfy condition 3.14 with any e, which proves the proposition. □ 


in [6] with e' < e. Since in the proof of Lemma 4.1 
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S-2 


Proposition C.3. \'ipk,b\ <22 

Proof. We note that for all d > 2, Cd < Hence ^p{x) < and consequently |V’(2;)| < \- 

Since 

V^fc,b(x) = 2tV’f25x-6)) (106) 


I I te 

we get that I'lpkfil <22 


-2 


□ 


Proposition C.4. if f G C'^ and ||Vj||op is bounded, then The eoefficients {tpk,b, f) satisfy 

|(^M,/)| =0(2-(2^+^)) (107) 

Proof. 

= 2^ / ( 2 ^ {x - b)] f {x)dx (108) 

^{y)fi‘^~^y + b)dy. (109) 


k 

= 2"2 


'supp(V’) 


where we have used change of variables. Since that / is twice differentiable, we can replace / 
by its Taylor expansion near b 


/ supp(V>) 


V'(y)/(2 <^^ + 6)42/ 


/ supp(V;) 


V'(y) 


+ 2-3(y,V;(5)) + 0(||V^(6)||op(2-3||2/||2)^) dy. 


( 110 ) 

( 111 ) 


By Proposition C.l if has two vanishing moments; this gives 


KV'M:/)l=0(^2-(2! + |)||Vf 


op 'ip{y)\\y\\2dy 

J SUpp(V') y 


( 112 ) 


Since by Proposition C.2 if{y) decays exponentially fast, the integral /supp(^) V’(y)ll2/ll2^2/ is 
some finite number. As a result, 


KV’M,/)l = 0(2-(2^+i)). 

We will also use the following property: 

Remark C.5. Every x is in the support of at most 12'^ wavelet terms at every scale. 


(113) 

□ 


We are now ready to prove Lemma 4.5 
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Proof. Let / G L 2 (M'^), d < 3 be compactly supported, twice differentiable and with ||Vj||op 
bounded. / can be expressed as 

f = f)'ipk,b- (114) 

(A:,b)eA 


approximating / by Jk, which only consists of the wavelet terms of scales k < K, we obtain 
that for every a: G 


OO 

\fix)-fKix)\< ^ \'^k,b\i^k,b,f)- (115) 

k=K+l be2-’^Z 


By R emark C.5, at most 12'^ wav elet terms are supported on x at every scale; by Proposition 


C.3 


(115) gives 


K r» / 2fe k \ 

\'f’k,b\ < 22 “^; by Proposition C.4 \{'f'k,b, f)\ = Plugging these into Equation 


\f{x) - fK{x)\ 


oi ^ 12'^2t-22-(f+t) 

\k=K+l / 


o{ E 2 

\k=K-\-l 


. 

O 2- — 


d 


(116) 

(117) 

(118) 


□ 
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